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1 Introduction 



The need for estimating large covariance matrices arises naturally in many scientific 
applications. For example in bioinformatics, clustering of genes using genes expression 
data in a microarray experiment; or in finance, when seeking a mean-variance efficient 
portfolio from a universe of stocks. One common feature is that the dimension of the 
data Pn is usually large compare with the sample size n, or even pn ^ n (genes expression 
data, fMRI data, financial data, among many others). The sample covariance matrix S 
is well-known to be ill-conditioned in such cases. Even for S = / the identity matrix, the 
eigenvalues of S are more spread out around 1 asymptotically as Pn/n gets larger (the 
Marcenko-Pastur law, Marcenko and Pastur, 1967). It is singular when pn > n, thus not 
allowing an estimate of the inverse of the covariance matrix, which is needed in many 
multivariate statistical procedures like the linear discriminant analysis (LDA), regression 
for multivariate normal data, Gaussian graphical models or portfolio allocations. Hence 
alternatives are needed for more accurate and useful estimation of covariance matrix. 

One regularization approach is penalization, which is the main focus of this paper. 
Sparse estimation of the precision matrix fl = has been investigated by many re- 
searchers, which is very useful in Gaussian graphical models or covariance selection for 
naturally ordered data (e.g. longitudinal data, see Diggle and Verbyla (1998)). Mein- 
shausen and Biihlmann (2006) used the Li-penalized likelihood to choose suitable neigh- 
borhood for a Gaussian graph and showed that Pn can grow arbitrarily fast with n for 
consistent estimation, while Li and Gui (2006) considered updating the off-diagonal ele- 
ments of fl by penalizing on the negative gradient of the log-likelihood with respect to 
these elements. Banerjee, d'Aspremont and El Ghaoui (2006) and Yuan and Lin (2007) 
used Li-penalty to directly penalize on the elements of f2, and develop different semi- 
definite programming algorithms to achieve sparsity of the inverse. Friedman, Hastie and 
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Tibshirani (2007) and Rothman et al. (2007) considered maximizing the Li-penalized 
Gaussian log-likelihood on the off-diagonal elements of the precision matrix 17, where the 
Graphical LASSO and the SPICE algorithms are proposed respectively in their papers 
for finding a solution, and the latter proved Frobenius and operator norms convergence 
results for the final estimators. 

Pourahmadi (1999) proposed the modified Cholesky decomposition (MOD) which fa- 
cilitates greatly the sparse estimation of Q, through penalization. The idea is to decompose 
S such that for zero-mean data y = (t/i, ■ ■ ■ , yp„)^, we have for z = 2, ■ ■ ■ ,pn, 



where T is the unique unit lower triangular matrix with ones on its diagonal and («, j)**^ 
element —(pij for j < i, and D is diagonal with i^^ element af = var(ej). The optimization 
problem is unconstrained (since the (pi/s are free variables), and the estimate for Q, is 
always positive-definite. With MOD in ( 11. ip . Huang et al. (2006) used the Li-penalty on 
the 0ij's and optimized a penalized Gaussian log-likelihood through a proposed iterative 
scheme, with the case Pn < n considered. Levina, Rothman and Zhu (2007) proposed 
a novel penalty called the nested LASSO to achieve a flexible banded structure of T, 
and demonstrated by simulations that normality of data is not necessary, with pn > n 
considered. 

For estimating the precision matrix Q, for naturally ordered data, apart from the 
nested LASSO, Bickel and Levina (2008) proposed banding the Cholesky factor T in 
(11. ip . with the banding order k chosen by minimizing a resampling-based estimation of 
a suitable risk measure. The method works on estimating a covariance matrix as well. 
While these two methods are simple to use, they cannot eliminate blocks of weak signals 
in between stronger signals. For instance, consider a time series model 



i-l 




(1.1) 



yi = 0.7?/i_i + 0.3yi_3 + 
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which corresponds to (11.11) with {/)j^2 = 0, = for j > 4. For example, this kind of 
model can arise in clinical trials data, where response on a drug for patients follows a 
certain kind of autoregressive process with weak signals preceding stronger ones. This 
implies a banded Cholesky factor T, with the first and third off-diagonal bands being 
non-zero and zero otherwise. Banding and nested LASSO can band the Cholesky factor 
T starting from the fourth off-diagonal band, but cannot set the second off-diagonal band 
to zero. And if these methods choose to set the second off-diagonal band to zero, then 
the third non-zero off-diagonal band will be wrongly set to zero. Both failures can lead 
to inaccurate analysis or prediction, in particular the maximum eigenvalue of a precision 
matrix can then be estimated very wrongly. Clearly, an alternative method is required 
in this situation. We present the block penalization framework in the next section and 
more motivations and details of the methodology. 

For more references. Smith and Kohn (2002) used a hierarchical Bayesian model to 
identify the zeros in the Cholesky factor T of the MCD. Fan, Fan and Lv (2007), using 
factor analysis, developed high- dimensional estimators for both S and S"^. Wu and 
Pourahmadi (2003) proposed a banded estimator through smoothing of the lower off- 
diagonal bands of T obtained from the sample covariance matrix (implicitly, Pn < n). 
Then an order for banding of T is chosen by using AIC penalty of normal likelihood 
of data. Furrer and Bengtsson (2007) considered gradually shrinking the off-diagonal 
bands' elements of the sample covariance matrix towards zero. Bickel and Levina (2007) 
and El Karoui (2007) proposed the use of entry-wise thresholding to achieve sparsity in 
covariance matrices estimation, and proved various asymptotic results, while Rothman, 
Levina and Zhu (2008) generalizes these results to a class of shrinkage operators which 
includes many commonly used penalty functions. Wagaman and Levina (2007) developed 
an algorithm for finding a meaningful ordering of variables using a manifold projection 
technique called the Isomap, so that existing method like banding can be applied. 
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The rest of the paper is organized as follows. In section [2|, we introduce the model for 
block penalization, and the motivation behind. A notion of sign-consistency, we name 
it block sign-consistency, is introduced. Together with asymptotic normality, we call it 
the oracle property of the resulting one-step estimator. An initial estimator needed for 
the one-step estimator, with the block zero-consistency concept, is introduced in section 
12.51 A practical algorithm is discussed, with simulations and real data analysis in section 
131 Theorems [2](i) and [3] are proved in the Appendix, whereas Theorems El^ii) and H] are 
proved in the Supplement. 

2 Block Penalization Framework 
2.1 Motivation 

For data with a natural ordering of the variables, e.g. longitudinal data, or data with a 
metric equipped like spatial data with Euclidean distance, if data points are remote in 
time or space, they are likely to have weak or no correlation. Then T in equation (11.11) . 
and thus fl, are banded. Banding and nested LASSO mentioned in section [T] are based 
on this observation for obtaining a banded structure of the Cholesky factor T. See Figure 
(U^b) for a picture of a banded Cholesky factor. 

Also, for variables within a close neighborhood, the dependence structure should be 
similar. Equation (11.11) then says that coefficients on an off-diagonal band of the Cholesky 
factor T are close to neighboring coefficients (see also Wu and Pourahmadi (2003)). This 
means that we can improve our estimation if we can efficiently use neighborhood infor- 
mation (along an off-diagonal band of T) to estimate the values of individual coefficients. 

With these insights, we are motivated to use the block penalization method. In the 
context of wavelet coefficients estimation, Cai (1999) introduced a James-Stein shrinkage 



5 



rule over a block of coefficients, whereas Antoniadis and Fan (2001, page 966) were 
the ffist to point out that such method can be regarded as a special kind of penalized 
likelihood which penalizes on the L2 norm of a group of coefficients, and introduced 
a separable block-penalized least squares for simple solutions. Both papers argue that 
block thresholding helps pull information from neighboring empirical wavelet coefficients, 
thus increasing the information available for estimating coefficients within a block. Yuan 
and Lin (2006) introduced the same method, which they called the group LASSO, to 
select grouped variables (factors) in multi-factor ANOVA and compare grouped version 
of LARS and LASSO. Zhou, Rocha and Yu (2007) further introduced a penalty called 
the Composite Absolute Penalty (CAP) to introduce grouping and a hierarchy at the 
same time for the estimated parameters in a linear model. 

Block penalization allows for a flexible banded structure in T since zero off-diagonal 
bands can precede the non-zero ones. This is an advantage over banding of Bickel and 
Levina (2008) and nested LASSO of Levina et al. (2007) as discussed in section [H 
Moreover, the block sign-consistency property in Theorem[2t^i) implies a banded estimated 
Cholesky factor T if the truth Tq is banded. See Figure [U for a demonstration. 

■ ■■■ 

(a) (h) (c) (d) 

Figure 1: Pattern of zeros in the resulting estimator for T using (a)Block Penalization; 
(b)Bandzng; (c)Nested LASSO; (d)LASSO 
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2.2 Block penalization 

As pointed out in Levina et al. (2007), the MCD in (11.11) does not require the normahty 
assumption of the data, and they introduce a least squares version for their penahzation. 
We also use such an approach, and define 

n Pn 
i=l j=2 

with y^y] = {ya, ■■■ , Vij-if, </>„ = {4>^i2], " " " , 0p„[p„])^, and 0^-^] = (0^-1, ■ ■ ■ , (pjj-if. 

When px„{-) is singular at the origin, the term-by-term penalty X]r=2 X]j=iPA„(|0ij|) 
has its singularities located at each = 0, and the block penalty 

^(0n)= El>A„,(||^,||), (2.2) 

i=i 

has its singularities located at £j = for j = 1, ■ ■ ■ ,p„ — 1, where Xnj = Xn{Pn — jY^"^, 
ij = {4>j+i,i, 4>j+2,2, ■ ■ ■ y 4>pn,pn-j)'^ is the off-diagonal band of the Cholesky factor T 
in (II. ip . and || ■ || is the L2 vector norm. Hence this block penalty either kills off a whole 
off-diagonal band £j or keeps it entirely (see also Antoniadis and Fan (2001)). 
Combining (12. ip and (12.21) is the block-penalized least squares 

g„(</)J = L„(<^J+nJ(0„). (2.3) 

We will use the SCAD penalty function for p\{-) in (12. 2p . defined through its derivative 

p\ie) = Al|e<A} + (aX - ^) + l{,>A}. (2.4) 

SCAD penalty is an unbiased penalty function which has theoretical advantages over 
Li-penalty (LASSO). See Lam and Fan (2007) for more details. In fact, in Fan, Feng and 
Wu (2007), the SCAD-penalized estimate of a graphical model is substantially sparser 
than the Li-penalized one, which has spuriously large number of edges, partially due to 
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the bias induced by Li-penalty and hence requiring a smaller A that induces spurious 
edges. With we estimate D in (11 .ip by 

n n 

= ^^1' = ^'^ ^iy^j - yf\j]^my^ j = 2, 3, ■ ■ ■ (2.5) 

i=l i=l 

2.3 Linearizing the SCAD penalty 

Minimizing Qn{4>n) (12.31) poses some challenges. Firstly, Qn{4>n) is not separable, 
which makes our problem computationally challenging. Secondly, the SCAD penalty 
complicates the computations as there are no easy simplifications of the problem like 
equation (5) in Antoniadis and Fan (2001, page 966). 

Zou and Li (2007) showed that linearizing the SCAD penalty leads to efficient algo- 
rithms like the LARS to be applicable, and that sparseness, unbiasedness and continuity 
of the estimators continue to hold (see Fan and Li (2001)). Following their idea, we 
linearize each Pa„j in (12.21) at an initial value ll-^j^"*!! so that minimizing (12. 3p is 

equivalent to minimizing, for = 0, 

n Pn P71-1 

Ql'^ (0n,) = E T.(y^^ - + ^ E p'k. (Ii4'^ id ii^. ii ' (2-6) 

i=l j=2 j=l 

where we denote the resulting estimate by <pl^~^^\ Parallel to Theorem 1 and Proposition 
1 of Zou and Li (2007), we state the following theorem concerning convergence in iterating 
(12. 6p starting from k = 0. 

Theorem 1 For k = 0,1,2, ■■■ , the ascent property holds for Qn w.r.t. {<pl^^}, i.e. 

Furthermore, let (t)n^^^ = M{(j)l^^), so that M is the map carrying (f)^^'' to <pl^'^^^ . If 
Qn{(pn) = Qn{,M{(f)^)) Only for stationary points of Qn and if 4>n is a limit point of the 
sequence {4>l^^}, then 0* is a stationary point Qn. 
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This convergence result follows from more general convergence results for MM (minorize- 
maximize) algorithms. Hence starting from an initial value 4>n\ able to it- 

erate (12.61) to find a stationary point of Qn- Note that even starting from the most 
primitive initial value 0^^] = 0, the first step gives a group LASSO estimator since 
Pa„j(0) — ^nj = ^niPn — jY^"^ ■ Heuce the second step gives a biased reduced estimator 
of LASSO, as Pa^^ dl-^j'^^ II) — ^"^^ ll-^j^^ll > o.^nj- In section l275l we show how to find a 
good initial estimator which is theoretically sound, and iterating until convergence is not 
always needed. 

2.4 One-Step Estimator for (p^ 

We now develop a one-step estimator to reduce the computational burden and prove that 
such an estimator enjoys the oracle property in Theorem [2l The performance of this 
one-step estimator depends on the initial estimator (f)^^^ . Define, for £jQ denoting the true 
value of £j in T, 

JnO = {j ■ ijO = 0}, Jnl = {j ■ ijO 7^ 0}. 

Definition 1 An initial estimator 0^°^ is called block zero- consistent if there exists 7„, = 
0(1) such that (a) P(maXjgj^g ll'^j'^'' ll/(Pn — jY^"^ > In) ^0 as n ^ oo, and (b) for the 
same 7„, P(min,ej„, ||£f||/(p„ _ j)i/2 > ^ i. 

This definition is similar to the idea of zero-consistency introduced in Huang, Ma 
and Zhang (2006), but we now define it at the block level, which concerns the average 
magnitude of each element in the off-diagonal £^^\ With this, we present the main 
theorem of this section, the oracle property for the one-step estimator. 

Theorem 2 Assume regularity conditions (A) - (E) in the Appendix, and the Cholesky 
factor To of the true precision matrix VLq has kn < n non-zero off-diagonal bands. If the 
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initial estimator (fr^' for Qn in i\2. 6\) is block zero- consistent, then the resulting estimator 
4>n by minimizing l{2.6\) satisfies the following: 

(i) (Block sign-consistency) P{A fl 5) — 1, where A = {ij = for all j G Jno}, O'nd 
B = {sgn(4+fc,fc) = sgn(0°+fc ) for all j G J„i, k so that 0°+^ ;u ^ 0}. 

(a) (Asymptotic normality) Let (f)^^^ be the vector of elements of (p^^ corresponding to 
its non-zero off- diagonals. Then for a vector a.n of the same size as (p^i so that otn 
has at most kn non-zero elements and ||q:„|| = 1, if k^(iog^(kn + l)Y^'^/n = o(l), 
we have 

where H„ is block diagonal with pn — 1 blocks. Its (j — l)-th block is (t|oS~]^\, and Sjn = 
-E(yi[j](l)yi[j](l)^), where yi[j](l) contains the elements ofyty] corresponding to the non- 
zero off- diagonals' elements of (ji^yy 

From this theorem and regularity condition (C) in the Appendix, the size p„ of the 
covariance matrix can be larger than n. In particular, if kn is finite, the oracle property 
still holds when logp„ = o{n). This is useful for many applications with p„ > n, when 
the sample covariance matrix becomes singular, whereas Theorem [3] shows that as long 
as the Cholesky factor is sparse enough, we can get an optimal estimator of the precision 
matrix via penalization. 

Theorem 3 Let T be the one-step estimator as in Theorem 0, and D be diagonal with 
elements a'j as defined in \2.5) . so that = T-^D^-'^T. Then under regularity conditions 
(A) - (E) in the Appendix, with fio denoting the true precision matrix, 

- flolloc = Op{{kn + lf'\\ogpJnY/^), 
\\h - floll = Op{{kn + lf'\\ogprJnfl^), 
where ||M||oo = maxjj and ||M|| = Amax(M'^M). 
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We will demonstrate related numerical results in section [31 From this theorem, the 
method of block penalization allows for consistent precision matrix estimation as long 
as the cost of dimensionality logpn satisfies {kn + l)^logp„/n = o(l). In particular, if 
kn is finite, we only need logp„/n = o(l) for consistent estimation. On the other hand, 
provided the cost of dimensionality is not too large (e.g. p„ = n°' for some a > 0, so 
logp„ = a log 77, and is negligible), we need kn = o(n^/^) for element-wise consistency. 

2.5 Block zero-consistent initial estimator 

We need a block zero-consistent initial estimator for finding an oracle one-step estimator 
in the sense of Theorem [21 The next theorem shows that the OLS estimator T, where 
the sample covariance matrix is S = T^^D(T^^)"^ using the MCD in (11. ip . is block zero- 
consistent when Pn/n const. < 1. When p,„ > n, S is singular and T is not defined 
uniquely. Since we envisage a banded true Cholesky factor Tq with most non-zero off- 
diagonals close to the diagonal, we define T by considering the least square estimators of 
the regression 

i-l 

yi= ^ (l>i,jyj + e*, (2.7) 

where Cni = max{[2 — 7nJ,l} with some constant < 7 < 1 controlling the number 
of ?/j's on which rji regresses. The rest of the (pij^s are set to zero, recalling that even 
starting from the most primitive initial value (pjy] = O5 the one-step estimator is a group 
LASSO estimator since p^^^(O) = \nj = \n{Pn — JY^"^- 

Theorem 4 Assume regularity conditions (A) to (E) in the Appendix. Then the estima- 
tor T obtained through the above series of regressions is block zero-consistent, provided 
all the true non-zero off-diagonal bands o/Tq are within the first \_^n\ off-diagonal bands 
from the main diagonal o/Tq. 
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Remark : In high dimensional model selection, the condition of "irrepresent ability" 
from Zhao and Yu (2006), "weak partial orthogonality" from Huang et al. (2006) or the 
UUP condition from Candes and Tao (2007) all describe the need of a weak association 
between the relevant covariates and the irrelevant ones under the true model, for the 
estimation procedures to pick up the correct sparse signals asymptotically. In our case, 
with (11.11) as the true model, the association between the variables yi and yi, - ■ ■ ,yi-i 
for i = 2, ■■■ ,pn is incorporated into the tail assumption of the yi/s, which is specified 
in regularity condition (A). This assumption entails that the | Fiji's for i and j far apart 
are small, so that the association between the relevant y^s (corr. to (f)t,i 7^ 0) and the 
irrelevant y/s (corr. to (ptj = 0) in model (II. ip are small. 

In practice, for the series of regression described, we can continue to regress y^ on the 
next \ffn\ yj^s etc until all the 0jj's are obtained. We adapt this initial estimator in the 
numerical studies in section [3l 

Also in practice, the rate at which maxjgj^^ 11-^^°'' ll/(Pn — jY^"^ converges to zero in 
probability in definition [1] may not be fast enough for the OLS estimators. One way 
to improve the quality of the OLS estimators is to smooth along the off-diagonals of 
T. For instance, Wu and Pourahmadi (2003) smoothed along off-diagonals of the OLS 
estimator T to reduce estimation errors. This amounts to assuming that the coefficients 
= fj,p,X'i/Pn), where fj,p„{-) is a smooth function defined on [0, 1]. We then calculate 
the smoothed coefficients 

Pn-j 

<Pj+k,k = ^ Wj{r +j,k+ j)(f)j+r,r, 
r=l 

where the weights wjlr + j,k + j) depends on the smoothing method. We use local 
polynomial smoothing with bandwidth h ^ oo with h/pn — ^ 0, so that var(0j_|_fc^fc) = 
0{n~^h~^) (See Wu and Pourahmadi (2003) and Fan and Zhang (2000) for more details.). 
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2.6 Algorithm for practical implementation 

Yuan and Lin (2006) proposed a group LASSO algorithm to solve problems similar to 
(12. 6p . However, when pn is large, the algorithm is computationally very expensive. In- 
stead, we adapt an idea from Kim, Kim and Kim (2006) and use a gradient projection 
method to solve for the one-step estimator, which is computationally much less demand- 
ing. Since minimizing (12.61) can be considered as a weighted block-penalized least squares 
problem with weights w^j = np')^^.{\\£^^''^\\)/Xn, it can be formulated as: 

minimizing Ln{4>n) subject to < M (2.8) 

i=i 

for some M > 0. Since the further off-diagonal bands of T are too short, in practice 
we stack them together until it is of length of order We then treat it as one block 
in the above dual-like problem, and denote by s„ the number of off-diagonals in T after 
stacking. 

Assume for now that all the tuning parameters are known. Starting from an initial 
value and t = 1, the gradient projection method involves computing the gradient 
S/ Ln{4>n~^'^) defining b = 0^*"^'' — sVL„(0^*~^^), where s is the stepsize of iterations 
to be found in the next section. Denote by b(j) the jth block of b, with blocks formed 
according to the off-diagonals of T, j = 1, ■ ■ ■ , s„. Then the main step of the algorithm 
is to solve 

Sn 

4>i = argmin<^^gg||b - <pj^, with = { XI ^njUiW ^ ^^}' 
which is called the projection step. It can be easily reformulated as solving 

minX(||b(j)|| - MjY subject to ^w^^Mj < M, Mj > 0, (2.9) 

where then = Mjb(j)/||b(j)||, and we iterate the above until convergence. Standard 
LARS or LASSO packages can solve (12. 9p easily, but we adapt a projection algorithm 
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by Kim et al. (2006) which can solve the above even faster. In solving (12. 9p . we are 



Mj > 0. The key observation is that if such projection has non-positive values on some 
Mj's, then the solution to (12. 9p should have those iWj's exactly equal zero. Hence we 
can then recalculate the projection onto the reduced hyperplane until no more negative 
values occur in the projection, and it is easy to see that at most s„ such iterations are 
needed to solve (12. 9p . In detail, we start at r = {1, ■ ■ ■ , and calculate the projection 



for j = 1, ■ ■ ■ , Sn- We then update r = {j : Mj > 0} and calculate the above projection 
again until Mj > for all j. 

2.7 Choice of tuning parameters 

There are three tuning parameters introduced in the previous section, namely A„, M and 
s. The small number s is a parameter for the gradient projection algorithm and it is re- 
quired that s < 2/L, where L is the Lipchitz constant of the gradient of Ln{4>n)- can be 
easily shown that L = 2XUL{S^), where Sy = diag(^"^^ yi[2]yf[2]^ ■ ■ ■ , EHi yi[Pn]yI\p„]), 

so that S < Amax^(5'y). 

For the choice of M, note that for a suitable A„ and that £j = £jQ in (12.81) . we either 
have w'^j = or ijQ = 0. Thus, the value of J2^j=i ^njll-^ioll is always zero. In view of this, 
the oracle choice of M is actually zero. We adapt this choice in the numerical studies in 
section [3l 

For the choice of A„, we use a GCV criterion similar to the one used by Kim et al. 
(2006). We find T as defined in section [23| and smooth the off-diagonal bands of T to 



form T. Define W, = diag(i/;i„/||4J|lj:.„, ■ ■ ■ , i/^^^/ll^sll, </||^i||) 



and Xj = (yi[j], y2[j], ■ ■ ■ ,yn\j])'^, where 1^ denote the column vector of ones of length 



essentially projecting (||b(i) 



b(s„)||) onto the hyperplane Zljli 



^jMj = M with 




(2.10) 
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m. The GCV-type criterion is to minimize 

«°^''»' - A. („-tr|x,(xJx, + A„w,)-.x7i)^- t^'") 

where tr(-) denotes the trace of a square matrix. In practice we calculate GCV(A„) on a 
grid of values of A„ and find the one that minimizes GCV(A„) as the solution. 



3 Simulations and Data Analysis 

In this section, we compare the performance of block penalization (BP) to other regular- 
ization methods, in particular banding of Bickel and Levina (2008) and LASSO of Huang 
et al. (2006). 

For measuring performance, the Kullback-Leibler loss for a precision matrix is used. 
It has been used in Levina et al. (2007), defined as 

Lkl{^, S) = tr(S~'s) - log |S~'S| - p„, 

which is the entropy loss but with the role of covariance matrix and its inverse switched. 
See Levina et al. (2007) for more details of the loss function. We also evaluate the 
operator norm ||r2 — fioH for different methods to illustrate the results in Theorem [3] in 
our simulation studies. The proportions of correct zeros and non-zeros in the estimators 
for the Cholesky factors are reported. 

3.1 Simulation analysis 

The following three covariance matrices are considered in our simulation studies. 
L El = 0.8/. 

II. S2 : = 0i,i_2 = -0.6, 0i,i_4 = 0i,i_6 = -0.4, = otherwise; cr|o = 0.8. 
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III. S3:</.ij = 0.5^-^j<z; a% = 0.1. 

The covariance matrix Si is a constant multiple of the identity matrix, which is considered 
by Huang et al. (2006) and Levina et al. (2007). S2 is the covariance matrix of an AR(6) 
process, which has a banded inverse. S3 is the covariance matrix of an MA(1) process. 
It is itself tri-diagonal and has a non-sparse inverse. We investigate the performance of 
BP in such a non-sparse case. 

Regularity conditions (B) to (E) are satisfied for the three models by construction. 
Since all three define stationary time series models in the sense of (II. ip . condition (A) is 
satisfied from Gaussian to general Weibull-distributed innovations. 

We generated n = 100 observations for each simulation run, and considered p„ = 
50, 100 and 200. We used = 50 simulation runs throughout. In order to illustrate 
theoretical results and test the robustness of the BP method on heavy-tailed data, on top 
of multivariate normal for the variables, we also consider the multivariate ^3 for the vari- 
ables, which violated condition (A). Tuning parameters for the LASSO and banding are 
computed using 5-fold CV, while the parameter for the BP is obtained by minimizing 
GCV(A„) in (12. lip . We set the smoothing parameter h = 0.3 for local linear smoothing 
along the off-diagonal bands for demonstration purpose. The constant 7 and stacking 
parameter Sn mentioned in section 12.51 are set at 0.9 and pn — \2pn ] respectively. In 
fact we have done simulations (not shown) showing that smoothing along off-diagonals 
for the initial estimator can improve the performance of the one-step estimator. All the 
results below for the performance of BP are based on such smoothed initial estimators. 
Also, all subsequent tables show the median of the 50 simulation runs, and the number 
in the bracket is the SDmad which is a robust estimate of the standard deviation, defined 
by the interquartile range divided by 1.349. 

Not shown here, we have carried out comparisons between using GCV-based and 5- 
fold CV-based tuning parameter A.„ for the BP method, and both performed similarly. 
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Table 1: Kullback-Leibler loss for multivariate normal and simulations. 







Multivariate normal 


Multivariate is 






Pn 


LASSO 


Banding 


BP 


LASSO 


Banding 


BP 


Si 


100 


l.O(.l) 


1.1(.8) 


l.O(.l) 


7.7(3.8) 


10.7(9.3) 


7.8(3.9) 




200 


2.1(.2) 


2.4(3.4) 


2.1(.2) 


16.4(9.7) 


22.9(18.8) 


16.4(9.7) 




100 


27.2(1.4) 


11.1(6.5) 


5.6(.5) 


110.7(29.2) 


57.7(21.1) 


28.2(10.6) 




200 


264.6(39.9) 


20.4(12.3) 


11.5(.7) 


789.5(132.0) 


101.6(36.0) 


54.7(14.2) 


S3 


100 


8.8(.7) 


7.8(9.7) 


4.3(2.0) 


40.2(7.6) 


31.8(14.9) 


19.8(7.9) 




200 


19.4(1.5) 


24.9(83.4) 


18.1(23.1) 


99.6(23.6) 


70.3(35.4) 


56.3(26.0) 



However, the GCV-based method is much quicker, and hence results of simulations are 
presented with the GCV-based BP method only. 

Table [H shows the Kullback-Leibler loss from various methods for multivariate normal 
and ^3 simulations. We omit the case for = 50 to save space, but results are similar to 
those for higher dimensions. In general the higher the dimension, the larger the loss is for 
all the methods. On Si, all methods perform similarly as expected (sample covariance 
matrix performs much worse and is not shown). However on S2, BP performs much 
better for all p„ considered, especially when multivariate is concerned. The better 
performance is expected, since BP can eliminate weaker signals that precede stronger 
ones, but not particularly so for other methods. On S3, BP performs slightly better 
on average, particularly for multivariate ^3 simulations. For normal data, LASSO has 
smaller variability, though. 

To demonstrate results of Theorem [3], the operator norm of difference ||r2 — f2o|| for 
different methods are summarized in Table [2l Clearly BP performs better in comparison 
with LASSO and banding on S2, in both normal and ^3 innovations. The performance 
gap gets larger as p„ increases. For S3 BP still outperforms the other two methods in 
general, especially for heavy-tailed data. 
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Table 2: Operator norm of difference ||r2 — f2o|| for different methods. 







Multivariate normal 




Multivariate ^3 






Pn 


LASSO 


Banding 


BP 


LASSO 


Banding 


BP 


Si 


100 


•6(.l) 


.7(.3) 


■6(.l) 


1.7(.5) 


2.0(.8) 


1.7(.5) 




200 


.7(.l) 


.8(.4) 


.7(.l) 


1.8(.6) 


2.0(.9) 


1.8(.5) 


S2 


100 


5.9(.4) 


6.2(3.5) 


2.5(.4) 


11.3(4.6) 


11.0(6.6) 


7.2(3.5) 




200 


29.1(11.3) 


5.7(3.4) 


2.6(.4) 


58.1(11.2) 


12.1(5.7) 


7.7(2.3) 


S3 


100 


14.7(1.6) 


19.0(14.2) 


11.6(1.9) 


40.3(9.1) 


33.8(13.5) 


28.1(6.6) 




200 


16.0(1.4) 


27.4(63.7) 


18.4(6.1) 


46.1(6.0) 


42.2(17.3) 


35.5(11.0) 



Finally, to illustrate the ability to capture sparsity, we focus on S2 and summarize 
the correct percentages of zeros and non-zeros estimated in Table [31 BP almost gets all 
the zeros and non-zeros right in all simulations. The LASSO does poorly in the correct 
percentages of zeros. This is due to biases induced by LASSO that require a relatively 
small A, resulting in many spurious non-zero coefficients. The banding method does not 
work well too. However, note that both banding and BP do better as dimension increases. 



Table 3: Correct zeros and non-zeros(%) in the estimated Cholesky factors for S2. 



Multivariate normal 




Multivariate t 


3 




Pn 


LASSO 


Banding 


BP 


LASSO 


Banding 


BP 


Correct 


50 


60.6(2.3) 


73.5(20.1) 


100(0) 


56.5(3.5) 


89.1(12.3) 


95.6(14.0) 


percentage 


100 


75.3(.9) 


87.7(12.0) 


100(0) 


70.5(2.6) 


94.4(5.8) 


100(0) 


of zeros 


200 


73.5(.7) 


92.9(8.7) 


100(0) 


72.0(.7) 


97.3(2.7) 


100(0) 


Correct 


50 


99.6(.4) 


100(0) 


100(0) 


96.4(1.6) 


71.3(35.0) 


100(0) 


percentage 


100 


99.2(.3) 


100(0) 


100(0) 


95.1(1.8) 


72.3(33.3) 


100(0) 


of non-zeros 


200 


99.3(.3) 


100(0) 


100(0) 


97.1(.7) 


80.5(25.9) 


100(0) 
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3.2 Real data analysis 

We analyze the call center data using the BP method. This set of data is described in 
detail and analyzed by Shen and Huang (2005), and we thank you for the data courtesy 
by the authors. 

The original data consists of details of every call to a call center of a major northeastern 
U.S. financial firm in 2002. Removing calls from weekends, holidays, and days when 
recording equipment was faulty, we obtain data from 239 days. On each of these days, 
the call center open from 7am to midnight, so there is a 17- hour period for calls each 
day. For ease of comparison, following Huang et al. (2006) and Bickel and Levina (2008), 
we use the data which is divided into 10-minute intervals, and the number of calls in 
each interval is denoted by A^jj, for days i = 1, ■ ■ ■ , 239 and interval j = 1, ■ ■ ■ , 102. The 
transformation yij = {Nij + 1/4)^/^ is used to make the data closer to normal. 




50 55 60 65 70 75 80 85 90 95 100 



Time Interval 

Figure 2: Mean absolute forecast errors for different estimation methods. Average is taken 
over 34 days of test data from November to December, 2002. 
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The goal is to forecast the counts of arrival calls in the second half of the day from those 
in the first half of the day. If we assume = {yn, ■ ■ ■ , ^4,102)"^ ~ ^(m? S), partitioning 
Yi into yf ^ and yf ^ where yf ^ = {ya, ■ ■ ■ , yi,5iV, yf ^ = (2/^,52, ■ ■ ■ , ^^,102)^, and denoting 



M = 





1' 


f Sn 


S12 \ 


V J 






S22 / 



the best mean square error forecast is then given by the conditional mean 

This is also the best mean square error linear predictor without normality assumption. 

To compare performance of different estimators of S, we divide the data into a training 
set (Jan. to Oct., 205 days) and a test set (Nov. and Dec, 34 days). We estimate 
ft = X]i=iy«/205, and S by sample covariance, banding and BP. For each time interval 
j = 52, ■ ■ ■ , 102, we consider the mean absolute forecast error 

^ 239 

^^^j' = 34 XI 

1=206 

For BP, we use GCV with h = 0.1. The number k = 19 for banding is used in Bickel and 
Levina (2008). From Figure [21 it is clear that the BP outperforms the other two methods, 
in particular for the time intervals 66 to 75 corresponding to the mid-afternoon. 

Appendix: Proof of Theorems [2](i) and [3] 

We state the following general regularity conditions for the results in section [2l 

(A) The data y^, i = 1, 2, ■ ■ ■ , n are i.i.d. with mean zero and variance So? a symmetric 
positive-definite matrix of size Pn- The tail probability of y^ satisfies, for j = 
1,2, ■■ ■ ,pn, P{\yij\ > x) < K exp{—Cx'^), where d > and C, K are constants. 
The innovations ei2, ■ ■ ■ , eip„ for i = I,-- - ,n in (II. ip are mutually independent 
zero-mean r.v.'s and var(ejj) = (T|g, having tail probabihty bounds similar to the 
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(B) The variance-covariance matrix Sq in (A) has eigenvalues uniformly bounded away 
from and oo w.r.t. n. That is, there exists constants Ci and C2 such that 

< Ci < Amm(So) < Amax(So) < C2 < OO for all n, 

where Amin(So) and Amax(So) are the minimum and maximum eigenvalues of Sq 
respectively. 

(C) Let dni = min{0°]^j- : > 0}, where is the j-th element of 4>^i (see Step 2.1 
in the proof of Theorem [2](i) for a definition). Then as n — > oo, 

fen log Pn , Q k^logpn ^ ^ \og Pn , ^ 

nci^i ' ?^A„ ' nA2 

(D) The tuning parameter A„ satisfies 

< A„ < min , ^^^^"L/p , 
with [pn — j) ^ oo for all j G J„i as n — > oo. 

(E) The values cr^^ = maxi<t<p^ a^Q and a^^y^ = maxi<r<p„ var(yj>) are bounded uni- 
formly away from zero and infinity. 

The following lemma is a direct consequence of Theorem 5.11 of Bai and Silverstein 
(2006). 

Lemma 1 Let {yi}i<i<n be a random sample ofn vectors with length qn, each with mean 
and covariance matrix S . In addition, each element of ji has finite fourth moment. 
Then ifqn/n ^ i < 1, the sample covariance matrixSn = XlILiy^yf satisfies, almost 
surely, 

lim A^ax(S„) < A^ax(S)(l + Vif, lim A^in(S„) > A^i„(S)(l - Vly. 
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Proof of Lemma [T] By Theorem 5.11 of Bai and Silverstein (2006), the matrix S* = 
I]~^^^S„I]~^^^ which is the sample covariance matrix of I]~^^^yj, has 

lim A„,ax(S:) = (1 + v^)^ lim A^i„(S:) = (1 - ^f^f 

n— ►oo n— ►oo 

almost surely. Since £ < 1, this implies that S* is almost surely invertible. Then by 
standard arguments, 

lim A^i,(S„) = lim A^i„(Si/2g*5.i/2) > A^i,(S)(l - Vif 

n— »oo n— »oo 

almost surely. The other inequality is proved similarly. □ 

Proof of Theorem [2l The idea is to prove that the probability of a sufficient condition 
for block-sign consistency approaches 1 as n — s> cx). We split the proof into multiple steps 
and substeps to enhance readability. We prove for the case /cn > 1 first, with the case 
= put at the end of the proof. 

Step 1. Sufficient condition for solution to exist. An elementwise sufficient condi- 
tion, derived from the Karush-Kuhn- Tucker (KKT) condition for to be a solution to 
minimizing (12. 6p (see for example Yuan and Lin (2006) for the full KKT condition), is 

n 

2^yi^t_j{yit - yj[t]^t[t]) = ^nw'^j4)t,t-j/\\ij\\, for all Ij ^ 0, (A.l) 



1=1 

n 



t[t]' 



i=l 



< \nwUpn - jT'^', for all I, = 0, (A.2) 



where t = j + 1, ■ ■ ■ and w^^ = np'^^^.^li^j^^lD/Xn (see section 1221 for more definitions). 
We assume WLOG that the kn non-zero off-diagonals of the true Cholesky factor Tq are 
its first kn off-diagonals to simplify notations. We also assume no stacking (see section 
12. 6p of the last off-diagonal bands of T in solving (12.61) : the case of stacked off-diagonals 
can be treated similarly. 
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Step 2. Sufficient condition for block sign- consistency. To introduce the sufficient 
condition for block-sign consistency, we define Ctjk = XlILi yiM(j)y«M(^)"^ J5 ^ = 
1,2, where yj[j](2) contains the elements of ji^t] corresponding to the zero off-diagonals' 
elements of 4>t[f\, and yj[f](l) contains the rest. We also define, for t = 2, ■ ■ ■ 

n 

vt = eityi[t], e^t = yu - y\t]4>\t]. = diag(M;^6^^, • • • , w'^^, w^i), 

i=l 

^nt = - ,w'^lf, St = ikt~b^t/UbJ\,--- Jt,t^2/\\i2\\,kt~l/\\il\\f, 

where b^t = min(t - w'^j = ^njiPn - fY^''^ ■ Also, Vj(j),w„t(j) for j = 1,2 are 

defined similar to yj[t](j); 4>t\t\{3) ^iid 4^t[t]ij) J = are defined similarly also. 

For to be block sign-consistent, we need only to show that equation flA.ip is true for 
j = 1, ■ ■ ■ ,kn, equation flA.2p is true for j = /c„ + 1, ■ ■ ■ ,p„ — 1, and \4>t[t]i^) — 0t[t](l)| < 
|0^[j](l)|. It is sufficient to show that the following conditions occur with probability 
going to 1 (this is similar to Zhou and Yu (2006) Proposition 1; see their paper for more 
details): 

|C,llv,(l)| < nV2|0O (1)1 _ X^n~'/'CtAW^tSt/2, 

(A.3) 

|a2iC;i\v,(l) - v,(2)| < Xnn~'/\wnr{2) - |a2iC;AW„,s,|)/2, 

where t = 2, ■ ■ ■ and r = kn + 2, ■ ■ ■ ,pn. Since the matrix Cm has size at most kn 
and kn/n = o(l), Ctn is almost surely invertible as ^ cxd by Lemma [1] and condition 
(B). In more compact form, it can be written as 

(A.4) 

|G2iGr/(2)z(2) - ^1 < Xnn-'/\wn - |G2iGr/(2)W„(2)s(2)|)/2, 
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where 





= diag(C2ii, ■ ■ ■ ,Cp„ii), 


G21 — 


diag(C(fc„+2)2i, 


■ ■ ; Cp„2l), 


Gn(2) 


= diag(C(fc„+2)ii, ■ ■ ■ , Cp^ii), 


Z = 


(V2(lf,-- - ,Vp 


.(Iff, 


z(2) 




Z = 


K„+2(2)^,--- 


,vp„(2))^, 




= (4,(ir,---,c,„](i)T, 


W = 


diag(W„2, ■ ■ ■ , 


"^npn ) 5 


W„(2) 


= diag(W„(fc„+2), • ■ • , W„p„), 


S = 


(^2 ' ■ ■ ■ ' ^p„) 5 




s(2) 




Wn = 


(w„(fe„+2)(2)^, ■ 


■ ■ ,w„p„(2) 



Step 3. Denote by An and respectively the events that the first and the second 
conditions of flA.4p hold. It is sufficient to show -P(A^) — > and -P(-B^) ^ as n ^ 00. 



Step 3.1 Showing P{A'^) 0. Define 77 = G^/z, and ry^ = G]^/z„, where 

Zn = (^n,j)J>l with Znj = n^^^"^ Yl'i=l yir(^it'^{\y^r\,\eu\<ain)} , a truncatcd VCrsion of Zj = 

n~^/'^Yl^=iyir^it for some r, t with max(l,t — kn) < r < t. Denote by r]n,j the j-th 
element of rj^- In these definitions, a{n) — 00 as ?i — ^ 00. 

We need the following result, which will be shown in Step 5: 

E{max\r]n,j\) = OUkJogPnY^^a^in)). (A.5) 

j 

Since the initial estimator ^^f-* in fl2.6p is block zero-consistent, if A„ is chosen to 
satisfy condition (D), then 7^ in Definition [1] can be set to this A„,. It is easy to see that 

P{wtj = Vj e Jno) 1, Piwtj = 0, Vj G Jni) ^1 as n ^ 00. (A.6) 

By definition, 77^ - 77 almost surely as n ^ 00. Thus, l{max, |,;„,,i>ni/2d„,} - 
l{max, |,;,|>ni/2d„i} ^ almost surely, implying 

P(max |?7nj| > n^^'^dni) — P(max \r]j\ > n^^'^dni) as n —* oo. (A. 7) 
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Then by the Markov inequahty and flA.Sp . 



P(max|?7„j| >n^^'^dni) < E(max\r]„J) / (n^^'^dni) 
j j 

= 0{{K\ogpnY/'a\n)/{n'/'dni)) ^ 0, 

by condition (C) and for a{n) chosen to go to infinity slow enough. Hence by flA.7p . we 
have P(maXj \r]j\ > n^/'^dni) 0, thus 

P(A^) < P{Al n {<• = 0, Vj e J„i}) + P«. > 0, Vj G Jni) 

< P(max |r/,| > n^^^d^^) + P{w'^^ > 0, Vj G J„i) 0, 
j 

using (1A.6P and the fact that 

n {<• = Vj G J„i} = {|Gr/z| > n^/Vnil) C { max |r/,| > n^/'ci^i}. 

3.2 Showing P{Bl) ^ 0. Define C = G2iGn^(2)z(2), then Cj = {Ct2iC~^^\t{l)), 
for some t, r with t > kn + 2. Also, define x^a: = n""*^/^ X^^Li VirUik, and the truncated 
version (by a(n)) similar to z„,j in Step 3.1. Then we can rewrite Q = n~'^^'^^j^Xrkf]ki 
and define 

k 

for some r. The summation involves at most kn terms. 

We need the following results, which will be shown in Step 4 and 6 respectively: 

E(max |z„,,|) = 0((logpO'/'«'H), (A.8) 

k 

E{max\Cn,j\) = 0{kl\ogpna\n)). (A.9) 
j 

By definition, for all j, (n,j ~ ^ and Znj — Zj ^ almost surely, implying 
P(max \(n,j — Zn,k\ > A„n^^^/2) — P(max \rij — Zk\ > A„n^/^/2) — as — > oo. (A. 10) 

j^k ' ' j,k 
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Then by the Markov inequahty, flA.81) and flA.9|) 



P(max \Cn,j - Zn,k\ > A„ni/V2) < 2{E(max |C„j|) + ^(max \zk\)} / {Xnu'^'') 

j,k j k 

which goes to by condition (C), for a{n) chosen to go to infinity slow enough. This 
imphes P(maXj-fc IQ - Zk\ > Kn^'^ jl) ^ by flAlOl) . 

Define = {w'^j = n\/] e J„o} n {u;^^- = Vj G J„i}, so that P{Dl) ^ by ^KM- 
Hence using B'r^ n = {|C - z| > A„n^/^/2} C { max^-fc \Q - Zk\ > Ann^/^/2}, 

p{b:) < p{b: n D„) + p{d:) 

< P(max 10 -Zk\> Xnu'^V^) + P(D^) ^ 0. 



Step 4. Proof of ^A.8\) . This requires the apphcation of Orhcz norm of a random 



variable X, which is defined as ||^||^ = inf{C > : EiIj{\X\/C) < 1}, where ip is a 
non- decreasing convex function with ip{0) = 0. We define ipa{x) = exp(x") — 1 for a > 1, 
which is non-decreasing and convex with ipai^) = 0. See section 2.2 of van der Vaart and 
Wellner (2000) (hereafter VW(2000)) for more details. 
We need four more general results on Orlicz norm: 

1. By Proposition A. 1.6 of VW (2000), for any independent zero-mean r.v.'s Wi, 
define Sn = Y17=i^iy then 

\\Sn\U, < Ki{E\Sn\ + II max \Wi\\\^,), (A.ll) 

l<i<n 
n 

i=l 

where Ki and K2 are constants independent of n and other indices. 

2. By Lemma 2.2.2 of VW (2000), for any r.v.'s Wj and a > 1, 

II max W,\\^^ < max ||iy,||^,(log(m + 1))^/'^ (A.13) 
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for some constant Ka depending on a only. 

3. For any r.v.'s and a > 1, (see page 105, Q.8 of VW (2000)) 

E{ max \Wi\) < (log(m + 1))^/" max \\W,\\^^. (A.14) 

l<i<m l<i<m 

4. For any r.v. W and a > 1, 



Since the {yjr^jt)/s are i.i.d. with mean zero (variance bounded by CFyj^a^j^ by con- 
dition (E)), by (jAll, 

max||2„,j||^, < ma.^ K2{{E zl A^'^ + n-^l^{n\\a^{n)\\lVl^) 
j j 

< maxK2{ayMa,M + 0{a^in))) = Oia^n)). (A.16) 
j 

Then using f[06D and ([Oil) . 

£'(max \zn,j\) < (log(p„/c„ + 1))^^^ max ||2„,j||^2 
j j 

= 0((logp„)'/'«'W), 

which is the inequahty (lA.Sp . 



Step 5. Proof of M.5]) . By Lemma[T]and condition (B), the eigenvalues < r^i < 



Tt2 < ■ ■ ■ < Ttkn < oo of Ctn are uniformly bounded away from (by 1/r) and oo (by r) 
almost surely when n ^ oo. Then ||C(ii||, ||C^J^|| < r almost surely as n — oo. Hence 
for large enough n, 

<, = l|erC,llv„,,(l)f <r^||v„,(l)f, 

for some k and t, where is the unit vector having the k-th position equals to one and 
zero elsewhere. The vector v„^t(l) is the truncated version of vt(l) containing elements 
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Zn,i. Then by flXlSl) and flXl6l) . 



max ||?7„j||^2 = max \\ril J if < rmax 



1/2 



< r^y^ max 11^2 ^ r/c^^ max ||2„,i||v-2 

= 0(A;y2a2(n)). (A.17) 

With this, using (]A.14p . we will arrive at (lA.Sp . 



Step 6. Proof of M.Pj) . Since the yirVikS are i.i.d. for each r and k with mean 



c"rfco < cr^A/ (variance bounded by 0"^^^ for r k), arguments similar to that for (1A.16P 
applies and hence 

max\\xn,rk\\ip2 = Oia^in)). (A. 18) 

r,k 



Hence we can use flA.lSp . flA.lSD . flA.lTp and flA.18p to show that 



max||CnjlUi <n ^/^/c„max || max(x^^fc,?7^ fc)||^i 

j r,k ' 

< n~^/'^knKi log3max(||a;„,^fc||i, ||?7„,fc||l) 

r,k 

= 0{n-^'^kla\n)). (A.19) 
With this, using ( ]A.14p . we will arrive at ( lA.Qp . 



Step 7. Proving M . ^)) occurs with probability going to 1 for kn = 0. When kn = 0, 



So is diagonal, and we only need to prove (]A.2p occurs with probability going to 1. Then 



we need to prove (see Step 3.2 for definition of Xkj) P(maxfc<j \xkj\ < XnWnj / {'^^'^ )) ~^ 1- 
In fact by flA.6p . we only need to prove P(maxfc<j \xkj\ > A„n^/^/2) — > 0, which follows 
from flA.18P and flA.14p and arguments similar to flA.Tp or flA.lOp . 



P(jmix\xn,kj\ > A„ra^/V2) < 2-E(max |x„,fcj|)/(A„ra^/^) 

k<.j fc<J 
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by condition (C) and a{n) chosen to go to infinity slow enough. This completes the proof 
of Theorem [2](i). □ 



Proof of Theorem [3l We focus on — r2o||oo first, which amounts to finding 

/ = P(max \ujij — Uijol > ^n)) (A. 20) 

for some t„ > 0. 

Note that = J2l=i'^ro^r,i'Pr,j with 0j j = —1 and 0jj = for ? < j. We write 
a)jj — uJijQ = Ji + ■ ■ ■ + /§, where (J5 to /§ are omitted since they have orders smaller than 
either of Ji to I4 under block sign-consistency) 

Pn Pn 



k=l k=l 

Pn Pn 

k=l k=l 



and alQ = n ^ XlLi ^Ik = n ^ ELid/ifc - yI[k](Pk[k]f ■ Then, the probability / in (!A.20p 
can be decomposed as 

8 

I </ arP{iaax\Ir\ > 5tn), 

r=l 

where and 6 are absolute constants independent of n. 



Step 1. Proving the convergence results. The proof consists of finding the orders of 
maxjj to maxjj I/4I. We will show in Step 2 that when /c„ > 0, 

max \In,3\ = OpiiiK + If hgpn/ny/^), (A.21) 

which has the highest order among the four. When kn = 0, P(/3 = 0) ^ 1 by block 
sign-consistency, and maxj j I/2I has order dominating the four. In general, we will show 



in Step 4 that 



max 1 4,2 1 = Op{{kn + mogpn/nf/^). (A.22) 
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Hence 



\n - rioIlL = max^LJij -UijoY = Op((A;„ + 1)^ logp„/ra). 



For II 12 — Oq II, using the inequality ||M|| < maxj |mjj| for a symmetric matrix M (see 
e.g. Bickel and Levina (2004)), we immediately have 

\\n-no\\=Opiikn + i)\\n-no\\oo), 

where we used the block sign-consistency and the fact that f2o has kn number of non-zero 
off-diagonals. 



Step 2. Proving II A. 21]) By the symmetry of J3 and J4, we only need to consider 
max, J IJ3I. 

Step 2.1 Defining In;3- By block sign-consistency of £1 ■ ■ -ikn are non-zero 
with probability going to 1 and flA.ip is valid for j = 1, ■ ■ ■ ,kn- Then we can rewrite 



flATTj) into 

Cai(0i[i](l) - 03i](l)) = n-'/\t{l) - XnWntSt - Cii20i[i] (2), (A.23) 

for t = 2, - ■ ■ ,pn- Block sign-consistency implies 0j[j](2) = with probability going to 1. 
Also by flA.6l) . W„t = with probability going to 1. Hence 

0,[,](1) - 0°,j(l) = n~'/'C;,\Ml) + op(l), 

where almost sure invertibility of Cm follows from Lemma[T]and condition (B) as n — > 00. 
This implies that, for j = 1, ■ ■ ■ , kn (note I3 = = when kn = 0) and t = 2, ■ ■ ■ ,pn, 

^t,t-j - <Pl-j = n~^'\k + op(l), (A.24) 
for some /c, where rj is defined in Step 3.1 in the previous proof. Then we can write as 

Pn 

k=l 
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for some intergers ■■ ,ip„. Note that has at most [k^ + 1) terms in the above 
summation. We define 

Pn 

In,3 = n-'/'J2''koVn,jl,, (A.25) 

k=l 

where rj^^if, is defined in Step 3.1 of the previous proof. 

Step 2.2 Finding the order o/maxj j I/3I. Under conditions (A) and (E), <J~^Q(t)^i is 



bounded above uniformly for all i and k. Then using ( 1A.17I) and ( ]A.14I) 



P(max|/n,3| > 5tn) < E{maji\In,'i\) / {Stn) 



< n ^/2(logp,,)^/^(/i;„ + l)max{afcoVfc,J|^n,iJU2}/(5t. 
0({(A;„ + lf{\ogPr:)Y/^a\n)/{n"hn)). 



This shows that maxjj |/„.3| = Op{{{kn + 1)'^ logp„/n}^/^), which is also the order of 
maxjj IJ3I, since maxjj |/„,,3 — /3I almost surely, and a{n) goes to infinity at arbitrary 
speed. 

Step 3. Showing Ii = op(/2). By block sign-consistency, 4)^^{2) = with 
probability going to 1 for k = 2, - ■ ■ Hence 



al = n 



"'$^(ya.-yJ,]0fcM(l))' + op(l) 

i=l 

= (T^o - 2n"^/Vfc(l)'^Ufc[fc](l) + Ufc[fc](l)^CfciiUfe[fc](l) + op(l), 
where Uk[k]{l) = 0fc[fc](l) - 0°[fc](l)- This implies that 

< 2n"V20p(A;y2) . 0^(^1/2^-1/2) ^ rOpikJn) = Op{kJn), 

where r is an almost sure upper bound for the eigenvalues of Cfcii by Lemma [1] and 
condition (B). The order for ||vyfc(l)|| can be obtained using ordinary CLT. The order for 
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||ufc[fc](l)|| can be obtained by observing (ptj — (ptj = ^^'^^J^tii^ti^) + op(l), and by 
conditioning on yi[t] for alH = 1, ■ ■ ■ , n, 

var(n-i/2eJC,-ilv,(l)) = n-iE(eJC,llv,(l)v,(l)^C,lle,) 

Hence the delta method shows that (t^^ — a^^ = Op{kn/n). 

On the other hand, by the ordinary CLT, we can easily see that ^Iq — cIq = Op{n~^/'^). 
Thus I2 has a larger order than Ji since {kn/n)/n~^/'^ = knn~^^'^ = o(l). Hence we only 
need to consider P^hl > ^in) and ignore -P(|/i| > 5tn)- 



Step 4. Proving IIA.2^) . Delta method implies ct^q^ — cr^.Q^ = —cTkoi^lo ~ '^ko)i^ + 
op(l)). We then have 

Pn ^ n 
k=l ^ r=l ^ 

which is a sum of at most A;„ + 1 terms (corr. i = j) of i.i.d. zero mean r.v.'s having 
uniformly bounded variance (fourth-moment of e^fc) by condition (A). Now define 

Pn 



k=l ^ r=l ^ 

and using (1A.14P and arguments similar to proving (lA.16p . 



P(max|/„,^2| > Stn) < E{max\In,2\)/{Stn) 

= Oiikn + l)(logpJn)i/2^(n)/t„). 

Hence this shows that, by maxj |/„^2 " -^2! ^0 almost surely, 

max I/2I = Op{{kn + l){\ogpn/ny/^). 
This completes the proof of the theorem. □ 
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Supplement: Proof of Theorems [2](ii) and [4] 
Proof of Theorem [2](ii). To prove asymptotic normality for </)„^, note that by (IA.23p . 

for OLn with II CKn II = 1 and Vn = Ctniinf^n, 

v}/^v;,'i'cxl{4>ni - €i) = h + h + h, (S.i) 

where J2 = A„(ni/„)~^/2a^G^/W„s/2 , I3 = {n/unY/'^alG^^Gi24)^2 and Ji = Un^^^a^G 
with 0„2 ^he vector of elements of 0„ corresponding to its zero off-diagonals. 

Step 1. Showing = op(l). Since -P(0„2 = 0) ^ 1, we have P{Iz = 0) ^ 1, 
thus Is = op(l). Also, we can easily show that 

I/2I < Crf ia„(r2/jV2^-i/2^j2, 

where = max{p';)^^^, (||£j'^^||) : j G Jni}- Hence if a„ = o{vn'^{nln)^^^'^k^^), we have 
I -^2 1 = op(l). The SCAD penalty ensures that a„ = for sufficiently large n if the initial 
estimator 4)^^'^ is good enough, which is measured by its block zero-consistency. 

Step 2. We write q;„, = (a^ar ■ ■ ,"np„)^, so that Ji = z^n Ej=2 <^njC7i\vj(l). 
Define 

where Ejn = £"(0^11). We can rewrite Ji = X]r=i'"^n.«' 'where 

Pn 

are independent and identically distributed with mean zero for all i. Our aim is to utilize 
the Lindeberg- Feller CLT to prove asymptotic normality of Ji, then argue that /i itself 
is distributed like Ji, thus finishing the proof. 

Step 3. Showing asymptotic normality for Ii. First, by suitably conditioning on 
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the filtration Tt = c"{ei, ■ ■ ■ , ej} generated by the sj = (ey, ■ ■ ■ , enj)'^ for j = 1, ■ ■ ■ ,t, 
we can show that (proof omitted) var(/i) = 1. 

Step 3.1 Checking the Lindeberg's condition. Next, by the Cauchy-Schwarz in- 
equality, for a fixed e > 0, 

n 

^E^lA{M>^} = ^^(W^n,,ll{l«'n.i|>e}) 
i=l 

Step 3.1.1 The Markov inequality implies that 

P{wl,>^)<e-^E{wl,)=e-^n-\ 

thus {P{wl^ > e2)}i/2 = 0(n-i/2). 

S'fej* 3.1.2 For the former expectation, note that condition (B) implies that the 
eigenvalues of Sjn are uniformly bounded away from zero and infinity as well, say by 
and c respectively, so that < c for all j. Hence 

E[^^nj^Ji\eijyiij]{l)j < c^^(max|eij|||yiy](l)||)^ ■ ||a„j||) 

3=2 ^ j=2 

<c^fc2E( max |6y|||yi[,](l)||)^ 

<c^fc^E( max 6^ -i^dlyiwll)!!'), 

where the second line used the fact that there are at most kn of the ck^^ that are non-zero 
and that Y1^Z2 ll'^njlP ~ implies ( X]^l2 ll'^nill)'^ < k"^. The third line used conditioning 
arguments and the fact that yi[p„] (1) has the largest magnitude among the yiyj (l)'s. With 
the tail assumptions for the e^j's and the ?/jj's in condition (A), the fourth moments for 
maXj.Q,^^.^o Cij and ||yi[p„](l)|| exist. Using (lA.lSP and (lA.14p . can show 

E( max ey = 0({log(fc. + 1)}^/'^), i?(||yiw(l)r) = 0(fc^(log(fc. + 1))^/'^). 
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Hence E(^Y,%2^nj^jii('ij'yi[j]i^)j = 0{k^{log'^{kn + 1))'^'''^), and combining previous 
results we have 

n 

= 0{kl{log{kn + l)r/'n~'/'u~') = 0(1) 

1=1 

by our assumption stated in the theorem. Hence Lindeberg-Feller CLT imphes that 
/i^iV(0,l). 

Step 4. Showing Ii is distributed similar to Ii. Finally, note that E{Ii — Ii) = 
and using conditioning arguments as before, we have 

Pn 

var(Ji - h) = J]a|oi?«(C-A - S-\)C,n(C-A - 

i=2 

< max a|oi?(||CT,\-S-\f -llC.-nll) 

< max a%E{\\J:-,\r ■ HS.nf ■ \\^J,?C,n^J,T - /f ■ IICjAir ■ ||C,n||). 



As discussed before, we have < c and < c. Also, the semicircular law 

implies that \\TjJ^-/'^CjiiTjJ^-/'^ ^ -^IP = Op{kn./n). We also have, almost surely, ||Cjii||, 
II Cj 11 II < T for each j = 2, ■ ■ ■ ,pn as n oo. Hence for large enough n, by condition 
(E), 

var(/i — Ji) < c^r^ max ct^q ■ 0{kn/n) = o(l), 
so that Ji = Ji + op(l), and this completes the proof. □ 

Proof of Theorem [4l The true model for = {yu, ■ ■ ■ , y„i)^ (refer to (12. 7p ) is 

y, = X,i<^°[,]i + e„ (S.2) 
for z = 2, ■ ■ ■ ,pn, where (recall that Cni = max( [i — 'jn\ , 1) 

= (yc„i, ■ ■ ■ ,yj-i), 4>i[i\i = (0j,c„i, ■ ■ ■ 
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Step 1. To show P(maxjgj,^y ll^ill/(Pn — JY^"^ > 7n) 0. We need the following 
results, the first of which will be proved in Step 3: For each j e J„o with 1 < j < ['jn\ , 

E{\\i,f/ip^-jf) = 0{n-'), (S.3) 

and, for a non-decreasing convex function with tp{0) = 0, a generalization of flA.14p . 



E{ max \Wi\) < iP~\m) max \\Wi\\^. (S.4) 

l<i<m l<i<m 

Then, with the function ip{x) = a;^ in (1S.4I) . using ( 1S.3I) . and 7„ > 0, 
P(max \\i,\\/{pn-jy^' > In) < i5(max WUV (Pn - j)') 

= E{ max MWViPn-mK 

jeJnO,l<3<hn\ 

<{[in\y/\ max mmV{Pn-jm^' 

= 0{n-^/^) 0, 

where the second line used the fact that we have set the off-diagonal bands more than 
[7nJ bands from the main diagonal to zero. 

Step 2. To show P(minjgj^^ l|-^ill/(Pn. — JY^'^ > 7n) ^ 1- We need the following 
result, which will be proved in Step 4: For j G Jni, 

Em\y{pr.-j)) = ||£,.o||V(p„ - j) + 0{n-'). (S.5) 

Then with 7„ < min^-gj^, UjoW / {Pn - jY^^, writing aj = (7„ - \\ijo\\/{Pn - jY^^Y, 

P(min \m/{Pn-jY^' > 7^ > 1 - 5^ Pm\/{Pn-jY^' < In) 

Jnl 

> 1 - E ^((ll^'^-ll - ll^.0ll)V(Pn - j) > (7n - VAIiPn-jY'')') 

^1- E 2aj\p^-jr'\\e,4'{l-{l + 0{n-\pr,-m'^' + 0{n-\p^- 
= l-0ikjn)^l, 
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where the second last hne used the delta method, with (1S.3I) showing the remainder term is 
going to zero. From Steps 1 and 2, we need to choose < 7„ < min^gj^^^ ||£jo||/(j'n — j)^^^- 

Step 3. To prove ^S. We need the following result, which can be easily 
generalized from Theorems 10.9.1, 10.9.2 and 10.9.10(1) of Graybill (2001): Let e = 
(ei, ■ ■ ■ ,em)'^, where the e^'s are i.i.d. with mean 0, and with finite second and fourth 
moments. Then for symmetric constant matrices A and B, 

E{{e^Ae){e^Be)) = atr(A)tr(5) + 6tr(A5), (S.6) 

where a and b are constants depending on the second and fourth moments of ej only. 

The estimator T, obtained from a series of linear regressions introduced in the theo- 
rem, has rows such that by (]S.2[) . 

Using (lS.2p . for j G Jno and 1 < j < [7^] , it is easy to see that 

Pn 

i=j+i 

Pn 

i=j+l 

where Ai = Xj(XfXj)~^er. ^.e^ (XfXj)~^Xf , and rjj is some constant depending on i 
and j. With this notation, we have 

\\ij\\ViPn-jr = iPn-jr' Y i^rArer)ielA,e,). 

r,k=j+l 

It is then sufficient to show that E{{eJ Arer){e1 AkSk)) = 0{n~'^) for each r > k. Let 
J-'i^i = cr{ei, ■ ■ ■ , ej_i} be the sigma algebra generated by the for 1 < A; < z — 1. For 
large enough n, we have by Lemma[T]and condition (B), for some constant independent 
of n, and for each i = j + 1, ■ ■ ■ , pn, 

tr(A,) = e^^.(XfX,)-^e,,,, < B,n-\ (S.7) 
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Step 3.1 To show E((e^A^e^)(e^v4fcefc)) = 0{n~^) for r > k. Hence for r > k with 
large enough n, using flS.7p . 

E{{ejArer){elAkek)) = E(ejAfcefcE^^„,(e^A,e,)) = E{elAkekCr%tT{Ar)) 

< Sy^Mn-' = 0{n-'). 



Step 3.2 To show E{{e^ A^erY) = 0{n ^). Using flS.6p . with constants a and b 



uniformly bounded by condition (A) and condition (E), it is sufficient to show that for 
large enough n, tr2(A.) and tT{Al) are 0{n~^). By (ETj) we have tr^{Ar) = 0{n-^). Also, 

tr(A^) = (e^„(X^X.)-V.,,)^<i?>-^ 
for large enough n, by (]S.7p . 



Step 4. To prove ^S.5\) . For j G J„,i and large enough 



n, 



Pn 



(T,\^maxi?(tr(Aj)) 
<ll^.o|lV(Pn-j) + 0(n-i), 

where the last line used (IS.Tp . This completes the proof of the theorem. □ 
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